import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('Mall_Customers.csv')
df.shape
(200, 5)
df.head()
| CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
|---|---|---|---|---|---|
| 0 | 1 | Male | 19 | 15 | 39 |
| 1 | 2 | Male | 21 | 15 | 81 |
| 2 | 3 | Female | 20 | 16 | 6 |
| 3 | 4 | Female | 23 | 16 | 77 |
| 4 | 5 | Female | 31 | 17 | 40 |
df.columns
Index(['CustomerID', 'Gender', 'Age', 'Annual Income (k$)',
'Spending Score (1-100)'],
dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 200 non-null int64 1 Gender 200 non-null object 2 Age 200 non-null int64 3 Annual Income (k$) 200 non-null int64 4 Spending Score (1-100) 200 non-null int64 dtypes: int64(4), object(1) memory usage: 7.9+ KB
#Changing the name of some columns
df = df.rename(columns={'Annual Income (k$)': 'Annual_income', 'Spending Score (1-100)': 'Spending_score'})
#Looking for null values
df.isna().sum()
CustomerID 0 Genre 0 Age 0 Annual_income 0 Spending_score 0 dtype: int64
#Replacing objects for numerical values
df['Gender'].replace(['Female','Male'], [0,1],inplace=True)
#Checking values have been replaced properly
df.Gender
0 1
1 1
2 0
3 0
4 0
..
195 0
196 0
197 1
198 1
199 1
Name: Gender, Length: 200, dtype: int64
perform data preprocessing.
#Density estimation of values using distplot
plt.figure(1 , figsize = (15 , 6))
feature_list = ['Age','Annual_income', "Spending_score"]
feature_listt = ['Age','Annual_income', "Spending_score"]
pos = 1
for i in feature_list:
plt.subplot(1 , 3 , pos)
plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
sns.distplot(df[i], bins=20, kde = True)
pos = pos + 1
plt.show()
Perform k-means clustering using sklearn with arbitrary number of clusters.
#Pairplot with variables we want to study
sns.pairplot(df, vars=["Age", "Annual_income", "Spending_score"], kind ="reg", hue = "Gender", palette="husl", markers = ['o','D'])
<seaborn.axisgrid.PairGrid at 0x237fd0ca4f0>
Draw the inferences you find out from the clustering process.
Age and Annual Income
sns.lmplot(x = "Age", y = "Annual_income", data = df, hue = "Gender")
<seaborn.axisgrid.FacetGrid at 0x237fdc92910>
Draw the inferences you find out from the clustering process.
Spending Score and Annual Income
sns.lmplot(x = "Annual_income", y = "Spending_score", data = df, hue = "Gender")
<seaborn.axisgrid.FacetGrid at 0x237fdcb66a0>
Age and Spending Score
sns.lmplot(x = "Age", y = "Spending_score", data = df, hue = "Gender")
<seaborn.axisgrid.FacetGrid at 0x237fdcb8190>
#Creating values for the elbow
X = df.loc[:,["Age", "Annual_income", "Spending_score"]]
inertia = []
k = range(1,20)
for i in k:
means_k = KMeans(n_clusters=i, random_state=0)
means_k.fit(X)
inertia.append(means_k.inertia_)
#Plotting the elbow
plt.plot(k , inertia , 'bo-')
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()
Perform K-means clustering using sklearn with optimal number of clusters.
#Training kmeans with 5 clusters
means_k = KMeans(n_clusters=5, random_state=0)
means_k.fit(X)
labels = means_k.labels_
centroids = means_k.cluster_centers_
Which attributes are strongly correlated with Spending Score
import plotly as py
import plotly.graph_objs as go
Apply K-means clustering using sklearn with optimal number of clusters along with highly correlated features.
#Create a 3d plot to view the data sepparation made by Kmeans
trace1 = go.Scatter3d(
x= X['Spending_score'],
y= X['Annual_income'],
z= X['Age'],
mode='markers',
marker=dict(
color = labels,
size= 10,
line=dict(
color= labels,
),
opacity = 0.9
)
)
layout = go.Layout(
title= 'Clusters',
scene = dict(
xaxis = dict(title = 'Spending_score'),
yaxis = dict(title = 'Annual_income'),
zaxis = dict(title = 'Age')
)
)
fig = go.Figure(data=trace1, layout=layout)
py.offline.iplot(fig)
Draw the inferences you find out from the clustering process.
After plotting the results obtained by K-means on this 3D graphic, it's our job now to identify and describe the five clusters that have been created:
Yellow Cluster - The yellow cluster groups young people with moderate to low annual income who actually spend a lot.
Purple Cluster - The purple cluster groups reasonably young people with pretty decent salaries who spend a lot.
Pink Cluster - The pink cluster basically groups people of all ages whose salary isn't pretty high and their spending score is moderate.
Orange Cluster - The orange cluster groups people who actually have pretty good salaries and barely spend money, their age usually lays between thirty and sixty years.
Blue Cluster - The blue cluster groups whose salary is pretty low and don't spend much money in stores, they are people of all ages.
After developing a solution for this problem, we have come to the following conclusions:
KMeans Clustering is a powerful technique in order to achieve a decent customer segmentation. Customer segmentation is a good way to understand the behaviour of different customers and plan a good marketing strategy accordingly. There isn't much difference between the spending score of women and men, which leads us to think that our behaviour when it comes to shopping is pretty similar. Observing the clustering graphic, it can be clearly observed that the ones who spend more money in malls are young people. That is to say they are the main target when it comes to marketing, so doing deeper studies about what they are interested in may lead to higher profits. Althought younglings seem to be the ones spending the most, we can't forget there are more people we have to consider, like people who belong to the pink cluster, they are what we would commonly name after "middle class" and it seems to be the biggest cluster. Promoting discounts on some shops can be something of interest to those who don't actually spend a lot and they may end up spending more!